Parsing Word-Aligned Parallel Corpora in a Grammar Induction Context
نویسنده
چکیده
We present an Earley-style dynamic programming algorithm for parsing sentence pairs from a parallel corpus simultaneously, building up two phrase structure trees and a correspondence mapping between the nodes. The intended use of the algorithm is in bootstrapping grammars for less studied languages by using implicit grammatical information in parallel corpora. Therefore, we presuppose a given (statistical) word alignment underlying in the synchronous parsing task; this leads to a significant reduction of the parsing complexity. The theoretical complexity results are corroborated by a quantitative evaluation in which we ran an implementation of the algorithm on a suite of test sentences from the Europarl parallel corpus.
منابع مشابه
Unsupervised Learning for Natural Language Processing
Given the abundance of text data, unsupervised approaches are very appealing for natural language processing. We present three latent variable systems which achieve state-of-the-art results in domains previously dominated by fully supervised systems. For syntactic parsing, we describe a grammar induction technique which begins with coarse syntactic structures and iteratively refines them in an ...
متن کاملExperiments in parallel-text based grammar induction
This paper discusses the use of statistical word alignment over multiple parallel texts for the identification of string spans that cannot be constituents in one of the languages. This information is exploited in monolingual PCFG grammar induction for that language, within an augmented version of the inside-outside algorithm. Besides the aligned corpus, no other resources are required. We discu...
متن کاملA Gibbs Sampler for Phrasal Synchronous Grammar Induction
We present a phrasal synchronous grammar model of translational equivalence. Unlike previous approaches, we do not resort to heuristics or constraints from a word-alignment model, but instead directly induce a synchronous grammar from parallel sentence-aligned corpora. We use a hierarchical Bayesian prior to bias towards compact grammars with small translation units. Inference is performed usin...
متن کاملThe Impact of Grammar Enhancement on Semantic Resources Induction
In this paper describes the effects of the evolution of an Italian dependency grammar on a task of multilingual FrameNet acquisition. The task is based on the creation of virtual English/Italian parallel annotation corpora, which are then aligned at dependency level by using two manually encoded grammar based dependency parsers. We show how the evolution of the LAS (Labeled Attachment Score) me...
متن کاملLearning To Parse on Aligned Corpora
One of the first big hurdles that mathematicians encounter when considering writing formal proofs is the necessity to get acquainted with the formal terminology and the parsing mechanisms used in the large ITP libraries. This includes the large number of formal symbols, the grammar of the formal languages and the advanced mechanisms instrumenting the proof assistants to correctly understand the...
متن کامل